part-1

• DOMAIN: Automobile

• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5

continuous attributes

• DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon

• Attribute Information:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

• PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’

Steps and tasks:

  1. Import and warehouse data:

• Import all the given datasets and explore shape and size. • Merge all datasets onto one and explore final shape and size. • Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use. • Import the data from above steps into python.

  1. Data cleansing: • Missing/incorrect value treatment • Drop attribute/s if required using relevant functional knowledge • Perform another kind of corrections/treatment on the data.

  2. Data analysis & visualisation: • Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis. Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.

  3. Machine learning: • Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data. • Share your insights about the difference in using these two methods.

  4. Answer below questions based on outcomes of using ML based methods. • Mention how many optimal clusters are present in the data and what could be the possible reason behind it. • Use linear regression model on different clusters separately and print the coefficients of the models individually • How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

  5. Improvisation: • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

Import and warehouse data

Data cleansing

Data analysis & visualisation

EDA

univariate, bivariate and multivariate analysis

Heirarchical Clustering

K-Means Clustering

Linear regression on the original dataset

Linear regression on data with K means cluster

Linear regression on data with H-clusters

Improvisation

K-means appears to explain the highest variation in the datset,

but with a difference of only 1% when compared with other models, to get more clarity a larger dataset may be used,&With the above mentioned features it may be possible to get a higher accuracy or explainability of the models and its variables.

part-2

• DOMAIN: Manufacturing

• CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

• DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.

Attribute Information:

  1. A, B, C, D: specific chemical composition measure of the wine

  2. Quality: quality of wine [ Low and High ]

• PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Steps and tasks:

  1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data

There appears to be no misclassification when checking the it with the non missing target variables and the predicted clusters, Hence the new labels can be used as a target variable

part-3

• DOMAIN: Automobile

• CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.

The vehicle may be viewed from one of many different angles.

• DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. • All the features are numeric i.e. geometric features extracted from the silhouette.

• PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the

model using just the raw data.

Steps and tasks:

  1. Data: Import, clean and pre-process the data

  2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden

  3. Classifier: Design and train a best fit SVM classier using all the data attributes.

  4. Dimensional reduction: perform dimensional reduction on the data.

  5. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

  6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

DATA IMPORT AND CLEAN

There appears few missing values

EDA by using univariate, bi-variate and multivariate anayalis

PCA

SVM

Conclusion

1) THE model svm train test accuracy 98% 2) The model pca train test accuracy 94% both the 2 model given 90 accuracy on the test data both the 2 model prefect fit the data

PART-4

import the ware house data

EDA and visualisation univariate, bi-variate and multivariate

There appears to be outliers, will not be treating them as its highly likely that these are genuine observation

In IPL series finding the Grade A players & Grade B players batsman & finding the runs and strike rate fours & sixes & half centuries result

PART-5

  1. List down all possible dimensionality reduction techniques that can be implemented using python.
  2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

List of all Possible Dimensionality Techniques and when to use them

Dimensionality reduction techniques can be classifed into 3 types..!

Feature selection:

A)missing valve ratio: if the dataset has too many missing values, we use this  approach to reduce the number of varibles.We can drop the variables having a large number of missing values in them.
B)low variance filter: we apply this approach to identify and drop constant variables from the dataset.The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped.
C)High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.  
D)Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.    
E)Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets.

Components / Factor Based:

A) Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor.
B) Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible.
C) Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components.

Projection Based:

A) ISOMAP: We use this technique when the data is strongly non-linear.
B) t-SNE: This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well.
C) UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE.

from sklearn.datasets import load_digits digits = load_digits() digits.images.shape